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Abstract Latent feature models are a powerful tool for 
modeling data with globally-shared features. Nonpara- 
metric exchangeable models such as the Indian Buffet 
Process offer modeling flexibility by letting the num¬ 
ber of latent features be unbounded. However, current 
models impose implicit distributions over the number 
of latent features per data point, and these implicit 
distributions may not match our knowledge about the 
data. In this paper, we demonstrate how the Restricted 
Indian Buffet Process circumvents this restriction, al¬ 
lowing arbitrary distributions over the number of fea¬ 
tures in an observation. We discuss several alternative 
constructions of the model and use the insights gained 
to develop Markov Chain Monte Carlo and variational 
methods for simulation and posterior inference. 

Keywords Bayesian nonparametrics • Latent feature 
models • Indian Buffet Process 


1 Introduction 

Generative models are a popular approach for iden¬ 
tifying latent structure in data. For example, a mu¬ 
sical piece may be naturally modeled as a collection 
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of notes, each with associated frequencies. A patient’s 
health may be naturally modeled as a collection of dis¬ 
eases, each with associated symptoms. The text of a 
news article may be naturally modeled as a collection 
of topics, each with associated words. In each of these 
cases, we posit that there exists a small set of under¬ 
lying features that are responsible for generating the 
structure that we observe in the data. 


When the number of these underlying features is 
unknown, Bayesian nonparametric models such as the 
Indian Buffet Process (IBP) |GrifRths and Ghahramani, 


2011 provide an elegant generative modeling approach. 


Specifically, the IBP posits that there are an infinite 
number of potential underlying features, but only a 
finite number of features underlie any particular ob¬ 
servation. The IBP has been the foundation for a va¬ 
riety of modeling applications including choice behav¬ 
ior Goriir et al., 2006 , psychiatric comorbitities Ruiz 


et al., 2014 , network models Miller et al., 2009 , blind 


source separation 


Knowles and Ghahramani, 2007 


age modeling Zhou et al., 2009 , and time-series mod¬ 


els Fox et al., 2009 


Under the IBP, the prior distribution over the num¬ 
ber of features underlying an observation is governed 
by a single parameter a. The number of features in an 
observation is expected a priori to be distributed as 


Poisson(a). The two-parameter Griffiths and Ghahra¬ 


mani, 2011 and three-parameter Teh and Goriir, 2009 


extensions of the IBP retain this strong requirement 
for Poisson-distributed feature cardinality. Other non¬ 
parametric latent variable models such as the infinite 
gamma-Poisson process [Titsias, 2008 and the beta¬ 


negative Binomial process Broderick et al., 2015 Zhou 


et al., 2012 also exhibit a Poisson distribution over the 


number of non-zero features. Even IBP variants that 
posit various kinds of correlations between observa- 
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tions Gupta et al., 2013 Miller et al., 2008 or features 


Doshi-Velez and Ghahramani, 2009 retain the Poisson 


property on the number of features in each observation. 
[Caron, 2012 somewhat relaxes the Poisson constraint; 
their model allows the number of features underlying 
each observation to follow a mixture of Poissons. 

However, there may be situations in which we do 
not desire Poisson-distributed marginals. For example, 
power law behaviors are common in networks and natu¬ 
ral language. In medicine, the number of patients visit¬ 
ing a clinic without severe illnesses may be much more 
than predicted by a Poisson distribution. When mod¬ 
eling articles, we may wish to preclude the possibility 
of having no topics represented. Image data may come 
with labels, and the text of the label might provide 
strong clues about the number of objects we can expect 
to see in the image. In other settings, we may know ex¬ 
actly the number of active features associated with an 
observation. For example, when modeling audio record¬ 
ings, the number of speakers in each recording might be 
known. The IBP does not provide the flexibility to put 
an arbitrary prior distribution on the number of latent 
features in an observation; even with the mixture of 


Poissons allowed by Caron, 2012 we are constrained 


to overdispersed distributions with full support on the 
non-negative integers. 

In this article, we present and describe the Re¬ 
stricted Indian Buffet Process (R-IBP), a recently de¬ 
veloped model that allows an arbitrary prior distribu¬ 
tion to be placed over the number of features underlying 


each observation. Unlike the model of Caron, 2012 


this distribution can have arbitrary support, or even 
be degenerate on a single value. The R-IBP was origi¬ 


nally presented in Williamson et al., 2013 ; this paper 


extends upon that exposition. We present several alter¬ 
native constructions, new insights, and novel efficient 
inference techniques. 


2 Background: Completely random measures 
and Infinite Exchangeable Matrices 

Many Bayesian nonparametric models, including the 
IBP, can be expressed in terms of completely random 
measures (CRMs) [Kingman, 1 967 . A completely ran¬ 
dom measure /x is a random measure consisting of a 
collection of atoms /x = )Tb 7q Jg.Qon some space (<9, A) 
such that for any disjoint subsets A\, A 2 £ A, Ai(TA 2 = 
0, the masses /x(Ai), /x(A 2 ) assigned to those subsets are 
independent. 


The atom sizes 7and locations 9i and are gov¬ 
erned by a Levy measure h'(dir,d9)', different choices 
of the Levy measure yield different properties. For 
example, the Levy measure v{dr:,d9) = ca 7r _1 (l — 
n) a ~ 1 dTrH(d9) describes the homogeneous beta pro¬ 
cess |Hjort, 1990 , whose name reflects the fact that 
the atom sizes 7q are equal in distribution to the limit 
as I — > oo of Beta (™,c(l— y)) random variables. 
The Levy measure is(dTr,d9) = ^-K~ 1 e~ x ' IT d , nH{d9) de¬ 
scribes the gamma process, whose atom sizes similarly 
correspond to the infinitesimal limit of a gamma dis¬ 
tribution. We will write an arbitrary CRM with Levy 
measure v(dir,d9) as CRM (xx(g?7t, d9)). 

CRMs can be used to construct distributions over 
matrices with exchangeable rows and infinitely many 
columns. To do so, we first define a directing measure 
9 '■= Y^iLi n i^ 0 i ~ CRM (is(dTT, d9)) to be a CRM with 
Levy measure u. We then let := 

CRM (g(n)), n = 1,2,... be a sequence of CRMs whose 
Levy measure is some functional g(/x) of this direct¬ 
ing measure /x. Then, following de Finetti, the sequence 
Ci,C 2 , • ■ • is an infinitely exchangeable sequence of mea¬ 
sures. If we consider only the atom sizes z n i of these 
measures, then we can transform this sequence of ex¬ 
changeable measures into a sequence of exchangeable 
vectors Z n = (z n i, z n 2 , ■ ■ ■)■ Stacking these (infinitely 
long) vectors results in a matrix Z = {Z \ 1 ..., Z n ) with 
exchangeable rows. 

One of the most commonly used models in this 


class is the beta-Bernoulli process Thibaux and Jor¬ 


dan, 2007 , which defines a distribution over infinitely 
exchangeable binary matrices. The directing measure /x 
is distributed according to a beta process 

BP(c, a, H) := CRM (ca7T _1 (l - tt)*- 1 drrH(d9)) , 

where c, a > 0 and H is a probability measure on 0. 
Conditioned on /x, the Q n are distributed according to 
a Bernoulli process 

BeP (/lx) := CRM (<5i(d7r) fr(d9)). 

The number of non-zero entries in each row of the re¬ 
sulting matrix will be finite, but random; marginally, 
this number will be distributed as Poisson(a). 

Since the beta process and the Bernoulli process 
form a conjugate pair, we can integrate out the di¬ 
recting beta process measure and work directly with 
the exchangeable sequence of binary vectors. When 
c = 1, the resulting exchangeable distribution is known 


as the Indian Buffet Process Griffiths and Ghahramani, 


2011 , and the predictive distribution can be described 


1 Technically, a CRM can also include a deterministic, non- 
atomic component; however we ignore this for simplicity. 


in terms of the following analogy: Let each column of 
our matrix correspond to a dish in an infinitely-long 
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buffet, and each row correspond to a customer. The 
first customer selects a Poisson(cn) number of dishes. 
When the nth customer arrives at the buffet, there 
are a finite number of previously sampled dishes and 
an infinite number of unsampled dishes. He selects a 
dish that has previously been sampled m* times with 
probability rrii/n, and selects a Poisson(a/n) number 
of new dishes. For general c ^ 1, the corresponding ex¬ 
changeable process is known as the two-parameter IBP 


Griffiths and Ghahramani, 2011 Thibaux and Jordan 


2007 ; a related restaurant analogy is given in Griffiths 


and Ghahramani, 2011]. 

The Indian Buffet Process can be used as the basis 
for a latent feature model where both the number of 
latent features exhibited by a given data point, and 
the total number of latent features, are unknown. In 
this context, each row of the matrix corresponds to a 
data point, and each column corresponds to a latent 
feature; a non-zero entry indicates that a given data 
point exhibits a given feature. 

Different choices of CRMs yield different proper¬ 
ties in the resulting matrix. For example, the three- 
parameter Indian Buffet Process replaces the beta pro¬ 
cess directing measure with a stable-beta process; the 
resulting random matrix exhibits power-law behavior in 


the total number of features exhibited in N rows Teh 


and Goriir, 2009 . If we combine a gamma process di¬ 


recting measure with a sequence of Poisson processes, 


we obtain the infinite gamma-Poisson process Tit- 


sias, 2008 , a distribution over integer-valued matrices. 


Other exchangeable matrices constructed in this man¬ 


ner include the beta-negative binomial process Zhou 


et ah, 2012 Broderick et ah, 2015 and the gamma- 


exponential process Saeedi and Bouchard-Cote, 2011 . 


3 Exchangeable Binary Matrices with 
Arbitrary Marginals: The Restricted Indian 
Buffet Process 

The class of exchangeable matrices described in Sec¬ 
tion [2] offers significant modeling flexibility. One prop¬ 
erty, however, cannot be avoided by judicious choice of 
CRM: the distribution over the number of non-zero en¬ 
tries per row is always marginally Poisson. This prop¬ 
erty is a direct consequence of the complete random¬ 
ness of the underlying random measures /i and ( n . To 
show this property, we observe that, regardless of choice 
of directing measure, there will be some probability 7q 
that the column i is non-zero. Each column i is cho¬ 
sen independently, resulting in a binomial distribution 
over the number non-zeros entries per row. With infi¬ 
nite columns, the binomial distribution converges to a 
Poisson distribution. 


We can also show, intuitively, how imposing an ar¬ 
bitrary distribution over the number of non-zero entries 
must break the complete randomness. Suppose that we 
know that each row of our matrix has exactly J non¬ 
zero entries. Next, suppose that we observe J non-zero 
entries in the first k columns. We know that the re¬ 
maining (infinite) entries must be zero with probability 
one; the probabilities of the entries in the disjoint sets 
of columns 1... k and (k + 1)... are no longer indepen¬ 
dent. Complete randomness has been broken. 

The Restricted Indian Buffet Process (R-IBP), first 
introduced in Williamson et ah, 2013], is a distribu¬ 
tion over exchangeable binary matrices with an arbi¬ 
trary distribution over the number of non-zero entries 
per row. In the following sections, we describe several 
equivalent formulations for the R-IBP. While the focus 
is on restricted versions of the Indian Buffet Process, 
the ideas in this section can be similarly applied to 
build other matrices with arbitrary marginals, as we 
will describe in Section [4j 


3.1 Construction of the R-IBP via Restriction in the 
de Finetti Representation 


The R-IBP was originally constructed (in Williamson| 
et al., 2013 ) by manipulating the beta-Bernoulli pro¬ 
cess representation of the IBP. Recall from Section [2] 
that we can represent the IBP as a mixture of Bernoulli 
processes, directed by a beta process: 


:= ^2 Me, ~BP(c, a, H) 


Cn := J2 ; 


i '~'BeP(/i) 


(1) 


Zn : = {Zni)°Z 1- 


Since we are not interested in the locations 6i of the 
atoms, we will employ a slight misuse of notation and 
write Equation |T] as: 

/r ~BP(c, a , H) 

( 2 ) 

Z n ~ BeP Qj). 


We can modify this construction to give a re¬ 
stricted model where the number of non-zero entries 
per row is constrained to be some integer J, by replac¬ 
ing the Bernoulli process in Equation [2] with a restricted 
Bernoulli process 


R-BeP (Z n ;n,f 


BeP 

0 


if Ei z m = J 

otherwise. 


( 3 ) 
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where the associated normalizing constant is propor¬ 
tional to the probability that a random sample from 
a Bernoulli process has total mass J. More concretely, 
this gives 


R-BeP (Z n ; /U, / = Sj) 

T, Z 'ez IL ^ (i - A = J) 


where Z is the support of BeP(/i). 

This restricted Bernoulli process is the random mea¬ 
sure obtained by conditioning the Bernoulli process on 
its total sum; it can be seen as a nonparametric ex¬ 


tension of the conditional Bernoulli distribution Chen. 


2000 . Clearly it is no longer a completely random mea¬ 


sure: disjoint subsets of Z n depend on each other via 
the total sum. 

More generally, we may wish to have some arbitrary 
distribution / on the number of non-zero entries per 
row. We can obtain this by creating an /-mixture of 
the distributions described by Equation |4j so that the 
probability of a vector Z is given by 


R-BeP(Z; = f Zij R-BeP (Z; //, 8^ z,) ■ 

(5) 

We can substitute these restricted Bernoulli pro¬ 
cesses (Equations [4j [5]) for the Bernoulli processes in 
Equation [2j yielding the following Restricted Indian 
Buffet Process: 

fj, ~BP(c, a, H) 

iid ( 6 ) 

Z n ~ R-BeP(/x, /). 

Since the Z n are identically and independently dis¬ 
tributed given n, de Finetti’s theorem tells us the re¬ 
sulting matrix Z = (Z n )^ =1 has exchangeable rows. 

We note that even if we choose f{z) = Poisson(z; a), 
we do not recover the IBP. The IBP has Poisson(a) 
marginals over the number of non-zero elements in each 
row; however, conditioned on observing some elements 
in a row, the number of non-zero entries in the remain¬ 
ing elements are distributed according to a Poisson- 
binomial distribution. Complete randomness requires 
that distribution over the non-zero elements in some 
subset of columns does not depend on the number of 
non-zero elements in a disjoint subset of columns. In 
contrast, an R-IBP with f(z) = Poisson(,z; a) will re¬ 
tain Poisson(a) as the conditional distribution over the 
total number of non-zero entries, even after some entries 
have been observed. 


3.2 Construction via Subsets of an Exchangeable 
Sequence 


In Section |3.1[ we saw how the R-IBP can be rep¬ 
resented using the combination of a beta process di¬ 
recting measure and a sequence of restricted Bernoulli 
processes parametrized by this measure. Sometimes 
it is more convenient to work solely in terms of the 
exchangeable matrix Z (which has a finite number 
of non-zero columns), integrating out the (infinite¬ 
dimensional) directing measure /i. We can make use 
of the IBP predictive distribution to represent the R- 
IBP without representing the underlying beta process; 
however care must be taken to ensure the correct dis¬ 
tribution. 

We can generate an IBP-distributed sequence Z* = 
(Z*, Z 2 , ■ ■ ■) of vectors using the buffet-based predic¬ 
tive distribution described in Section [2j Since this se¬ 
quence is infinitely exchangeable, its law is invariant to 


shuffling the order of any finite subset Aldous, 1983 


A direct consequence of this is that any infinite sub¬ 
sequence Z of Z* is again infinitely exchangeable. Thus, 
we can construct an R-IBP(c, a,//distributed matrix 
Z by sampling a sequence of vectors Z* ~ IBP(c, a), 
and including each proposed vector Z* into our matrix 
Z with probability f{J2i Z ni)- 

We note that this is directly equivalent to the re¬ 
stricted Bernoulli process method described in Section 
3.1 if we integrate out the directing measure, a se¬ 
quence of Bernoulli process-distributed measures is de¬ 
scribed by the IBP. However, an undesirable property 
of the IBP-based procedure is that, unlike the Bernoulli 
process-based procedure, one must retain the entire se¬ 
quence Z* (or at least, its sufficient statistics) to gen¬ 
erate the next candidate for Z. If we generate our pro¬ 
posed distributions based on the column counts of Z 
rather than Z*, the resulting matrix will not have the 
desired law - and in general will not even be exchange¬ 
able. 

To demonstrate this lack of exchangeability, we will 
attempt to construct a R-IBP(c = 1, a, f = e)-|) matrix 
Z by generating candidate vectors for Z n based only on 


the counts of Zi :n _i. As shown by Fortini et al., 2000 


and Aldous, 1983 , a sequence is infinitely exchangeable 
iff 


(Zn+i,Z n+2 )\(Z u ..., Zn) = ( Zn+ 2 , Ev+i)l(-^i ... Zn). 


It therefore suffices to check whether (Z 2 , Z 3 )\Zi = 
(Z 3 , Z 2 )\Zi. Let P be the law of the IBP with pa¬ 
rameters l,a, and let P* be the law of the proposed 
variant. Since our restricting function f = S i, triv¬ 
ially we have P*(Z\ = (1,0,0,...)) = 1. We will 
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compare P*(Z 2 = (1,0,0,...), Z 3 = (0,1,0,...)) and 
P*(Z 2 = (0,1,0,...),Z 3 = (1,0,0,...)). 

Under the Indian Buffet Process, we have P(Z 2 = 
(1,0,0,...)|Z! = (1,0,0,...)) = ie -“/ 2 and P{Z 2 = 
( 0 , 1 , 0 ,...)!^ = ( 1 , 0 , 0 ,...)) = fe-“/ 2 ; there¬ 

fore if we restrict Z 2 to these two cases, P*{Z 2 = 
(1,0,0,...)) = 5 ^ and P*(Z 2 = ( 0,1,0,...)) = 5 %. 
Following a similar argument, 

P*(Z 3 = (0,1,0,...)|Z 1 = (1,0,0,...),Z 2 = (1,0,0,...)) 
a 

a + 6 

and 

P*((Z 3 = (1,0,0,... )\Z X = (1,0,0,...), Z 2 = (0,1,0,...)) 
3 

6 + 2 a 
So, 

P*(Z 2 = (1,0,0,...), Z 3 = (0,1,0,...)) 

2 a 2 a 

2 + a a + 6 a 2 + 8 a + 12 

and 

P*(Z 2 = (0,1,0,...), Z 3 = (1, 0,0,...)) 

a 3 3a 

~~ 2 + a 6 + 2 a ~ 2 a 2 + 10 a+ 12 ' 

d 

Clearly, (Z 2 ,Z 3 )\Zi ^ (Z 3 ,Z 2 )\Zi under the pro¬ 
posed construction, meaning the resulting sequence is 
not exchangeable. In order to construct an exchange¬ 
able sequence via the IBP, we must record the entire 
IBP-generated sequence and then select an appropriate 
sub-sequence. 


3.3 Construction via Tilting the Bernoulli Process 


A tilted CRM /i* is a random measure obtained by 
scaling the law of a CRM /i on {D,A) by its total 
mass, according to some function /i(f?) [Lau, 2013 
that 


so 


PA*) ■■= 


i 


nhAm 


h(v(ft))P^(dv). 


(7) 


For example, if h(x) = e~ JX , then /i* is said to be ex¬ 
ponentially tilted. Exponentially tilting a CRM yields 
a different CRM Lau, 2013 ; for example an exponen¬ 
tially tilted a-stable process is equal (in distribution) 
to a generalized gamma process Brix, 1999 . In gen¬ 
eral, however, a tilted CRM will not be a completely 
random measure. For example, if h(x) = x~ q for some 
q > 0, then \i* is said to be polynomially tilted and is no 


longer a CRM. Random measures constructed via poly¬ 


nomial tilting include the Pitman-Yor process Pitman 


and Yor, 1997] (obtained by polynomially tilting an a- 
stable process) and the beta-gamma process [James 


2005] (obtained by polynomially tilting a gamma pro¬ 


cess). 

In Equation [5] the probability of a vector Z n under 
the restricted Bernoulli process is given by its probabil¬ 
ity under the Bernoulli process, scaled by a function / 
of the number of nonzero entries in Z n (or equivalently, 
the total mass of the corresponding random measure 
( n = Thus the restricted Bernoulli process 

can be described as a tilted Bernoulli proces^with the 
tilting function h{x) = /(: x). 


3.4 Construction via the Normalized Beta Prime 
Process and Invariance with respect to the Directing 
Measure 


As shown in Equation [2] the IBP can be written as 
a sequence of Bernoulli processes with a beta pro¬ 
cess directing measure /i. If only a finite number N 
of rows Z n have been observed, our uncertainty about 
H is described by a beta process with parameters 
C + N, J2n= 1 Cn- As N tends to infinity, 

this posterior will tend towards the uniquely defined 
directing measure /r. 

In contrast, the beta process directing measure /i 
for the R-IBP can never be uniquely determined, even 
with infinitely many observations. To show this, we 
can re-construct the R-IBP in terms of a beta-prime 
process Broderick et ah, 2014 . A beta-prime process- 
distributed CRM r := wycty is obtained by trans¬ 
forming the atoms 7r^ of a beta process-distributed 
CRM n := 7T iSei according to 


The de Finetti representation of the R-IBP can now be 
written as 


t := Wi5g i ~Beta-prime(c, a, H ) 

Jn ( 8 ) 

P(Z„\T,f) = ni<‘I(Ei*ni = J) . 

J2z'GzY[i W i'^(J2i Z ni ~ J) 

2 Arguably, the tilted Bernoulli process nomenclature is 
perhaps a better fit for the R-IBP, since for arbitrary / the 
“restricted Bernoulli process” is in fact a mixture of restricted 
distributions. However, the tilting interpretation was not ap¬ 
parent when the models described in this paper were first 
introduced in W illiamson et al., 2013] , so we continue to use 
original term “restricted” for consistency. 



































6 


Finale Doshi-Velez, Sinead A. Williamson 


The law P(Z n \r, f) is invariant to rescaling the Wi by 
some constant , that is, 

R-BeP(Z; {wj, J) = R-BeP(Z; {e^J, J) 


for any j3 € R. Rescaling the beta-prime process weights 
Wi by is equivalent to rescaling the atoms 7q of the 
corresponding beta process according to the nonlinear 
function 


1,1 71"^ + 1 - 7Tj ’ ^ 

which describes the Esscher transform of a Bernoulli 
random variable Gerber and Shiu, 1993 . Intuitively, 
this scale invariance occurs because the R-IBP first 
chooses the number of non-zero entries J n ~ / and then 
selects which entries will be non-zeros. Conditioned on 
J„, the absolute scale of the weights 7r no longer mat¬ 
ters; only their relative sizes are important. 

The connection between the restricted IBP and the 
beta-prime process makes it possible to remove extra 
degree of freedom present in the beta-Bernoulli (or 
beta-prime-Bernoulli) construction by fixing the scale 
through a normalized beta-prime process. While this 
is theoretically appealing - it leads to a unique direct¬ 
ing measure for each infinite sequence - it offers little 
practical advantage, due to the lack of a tractable rep¬ 
resentation for such a process. 


4 Extensions and Variations 

In Section [3j we focused on exchangeable models based 
on the IBP. However, the same ideas apply to exchange¬ 
able models based on other completely random mea¬ 
sures. One can also relax the exchangeability assump¬ 
tion to allow partial exchangeability, leading to models 
appropriate for data with observation-specific covari¬ 
ates. 


4.1 Restricted Exchangeable Matrices based on 
Different Completely Random Measures 

In Section lXTl we showed that the Restricted IBP can be 
constructed by starting from the beta-Bernoulli process 
representation of the IBP, and replacing the Bernoulli 
process with a restricted Bernoulli process. Rather than 
start from the beta-Bernoulli process, we could pick any 
pair of CRMs to generate an exchangeable sequence 
(Cn)n=ii provided we can parametrize the ( n using the 
directing measure p, for example if p and f n form a 
conjugate pair [Orbanz, 2009 : 

/j, ~CRM (u(dn, d6)) 

( 10 ) 

Cn ~CBM(s(/i)),n=l,2,... 


If the support of the random measures in Equa¬ 
tion [To] consists almost surely of measures with a finite 
number of non-zero atoms, the resulting exchangeable 
sequence can be interpreted as a row-exchangeable ma¬ 
trix with a finite number of non-zero columns. Exam¬ 
ples of such exchangeable matrices include the beta¬ 


negative binomial process Zhou et ah, 2012 Broder¬ 


ick et ah, 2015 and the gamma-Poisson process 


sias, 2008 


Tit- 


. All such models exhibit the property that 
the total number of non-zero elements of a row are 
(marginally) Poisson-distributed, a direct consequence 
of the complete randomness of the underlying random 
measures (as described in Section [3]). 

For any such exchangeable model, we can restrict 
the support of the to generate an exchangeable 
matrix with restricted support. If C, n ~ BeP(/z), this 
amounts to placing some distribution / over the num¬ 
ber of non-zero entries, as described in Section |3.1[ For 
other choices of CRM, however, the support of the unre¬ 
stricted measure will not be limited to binary vectors, 
presenting a wider range of possible restrictions. We 
discuss three possibilities below. 


Restricting the number of non-zero entries per row If 
is distributed according to a Bernoulli process, im¬ 
posing a distribution over the sum of a row is equiv¬ 
alent to imposing a distribution over the number of 
non-zero entries. For more general CRMs, these two 
cases are not the same. We first consider imposing a 
function / (]C fe I (zk > 0)) on the number of non-zero 
entries. This yields a restricted CRM with law 


R-CRM« [z-,g(p),f >0) 

«/ > °)j CRM (Z;g(p)) 


( 11 ) 


where CRM(g(p)) is the law of the corresponding un¬ 
restricted CRM. 


Restricting the sum of each row We can also impose 
a function /(XOt z k) on the total sum of each row (or 
equivalently the total mass of each measure £„), yield¬ 
ing 


R-CRMO [z-g(p),f 





CRM (Z-g(p)) 


( 12 ) 


A special case of the construction in Equation [12] is 
obtained when the directing measure p is distributed 
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according to a gamma process with parameter aH for 
some probability measure H , and the are distributed 
according to a Poisson process with mean measure fj, 
Titsias, 20081. In this case, if we restrict the total 
sum of each row following Equation [12} the distribution 
over the C, n is equivalent to that given by the following 
Dirichlet process-multinomial model: 


task with labeling information indicating the expected 
number of features per observation - for example, in 
image modeling we might have descriptions or low-level 
labeling. 

5 Simulation from the R-IBP 


p ~DP(cc, H) 

Jn~f 

Cn ~Mult(p, J n ) 

Restricting the sum of each row and the value of each 
element The examples in Equations m and [12] give a 
taste of the sort of distributions available under this 
construction. We can also specify more complex restric¬ 
tions. For example, we could generate an exchangeable 
binary matrix with J non-zero elements by taking let¬ 
ting the Cn be a CRM with integer-valued atoms, and 
restricting both the number of non-zero elements to be 
J and the values of the non-zero elements to be one. 
If the directing measure /t is distributed according to 
a gamma process, and the are distributed accord¬ 
ing to a Poisson process, each row corresponds to sam¬ 
pling J entries using conditional Poisson sampling from 
a Dirichlet-distributed random measure. 


4.2 Restricted Partially-Exchangeable Matrices 

In Section [3j we assumed that our data points (or 
equivalently, the rows of our matrix) are exchangeable. 
We can however modify the R-IBP to yield a par¬ 
tially exchangeable model appropriate for data with 
observation-specific covariates. For example, each ob¬ 
servation might have an associated label indicating a 
group membership m £ {1 ,...,M}, and each group 
could have group-specific restricting distribution f m , so 
that 

^ ~BP(c, a, H) 

Z n i ~ d 'R-BeP(/U, 

where m(n ) is covariate describing the group for obser¬ 
vation n. The resulting matrix would be partially ex¬ 
changeable in the sense that the distribution is invariant 
to permuting rows belonging to the same group. 

This model would be appropriate where we have 
observation-specific information about the number of 
non-zero features. For example, we might wish to con¬ 
struct a topic model with different distributions over 
the number of topics depending on the type of docu¬ 
ment (novels contain many topics, news articles contain 
fewer topics). Or, we might have a feature extraction 


In Section [3j we presented several constructions for the 
R-IBP. These constructions result in a variety of ap¬ 
proaches for sampling from the R-IBP prior. In this 
section, we discuss exact and approximate approaches 
for sampling from the R-IBP. 


5.1 Sub-sampling from an Exchangeable Model 

In Section |3.2[ we showed that the R-IBP can be 
constructed by subset selection of an IBP-distributed 
sequence of binary vectors. This directly suggests a 
scheme for generating exact from the prior. We gen¬ 
erate a sequence Z* = ( Z*) according to the Indian 
Buffet Process predictive distribution^] 


n —1 

m ni = 'y ( Z n j 

j= 1 

K+ = y^I(m ni > 0) 

k 

~Bernoulli(m n i /n) for i = l,..., K+ 

\ n ~Poisson(a/n) 

Znj =1 for 3 = K n + 1> • • ■ , K n + A 

Then, we include each Z* = (^* 1; z* 2 ;...) in our se¬ 
quence Z with probability P(Z* £ Z) = f{Y^i z ni)- 
Importantly, while not all the generated binary vectors 
.Z* are included in Z, they are all included in the counts 
77i n j, ensuring exchangeability is maintained. Rejection 
sampling in an exchangeable model produces perfect 
samples from the R-IBP, but can suffer from a low ac¬ 
ceptance rate. 


5.2 Approximate Sampling with a Conditionally 
Independent Model 


An alternative approach, inspired by the construction 
in Section |3.1[ is to explicitly sample the directing 
measure /r ~ BP(c, a,H) (or, alternatively, sample 


3 Here we use the one-parameter version, i.e. c = 1. We 
could easily substitute the two-parameter predictive distri¬ 


bution; see 


Griffiths and Ghahramani, 2011 
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r ~ Beta-prime(c, a, H)), and use this to sample a se¬ 
quence Z* = (Z*) of binary vectors. Practically speak¬ 
ing, we cannot represent the entire infinite-dimensional 
measure p (or r). However, we can work with finite¬ 
dimensional approximations to p to produce both ap¬ 
proximate and exact samples. We describe approximate 
approaches in this section and an exact approach in sec¬ 
tion [531 

We first need to produce a finite set of weights n 
that well approximate the infinite-dimensional measure 
p. We consider two options: 

— Weak Limit One approach is to use a finite vector 
of beta random variables that converges (in a weak 
limit sense) to the beta process [Zhou et al., 2009J, 
approximating /i with a vector j r = (fri,..., 7iy), 
where 

. ad fca ca\ 

TTt ~ Beta(^ — ,c- — J . (13) 


— Size-Ordered Stick-breaking Representation Another 
approach is to transform the arrival times of a unit- 
rate Poisson process based on the beta process Levy 
measure [Rosiski, 2001, iFerguson and Klass, 1972 


This approach gives exact samples from the size- 
ordered atoms of the beta process. In the special 
case where c = 1 , this yields a simple stick-breaking 
construction [Teh et al., 2007 : 


Ui ~Beta(a, 1) 

i 

71 'i 

1=1 


(14) 


If we let our truncated approximation tt) = 77 for 
i = 1... I and 7 r* = 0 for i > I, we obtain an ap¬ 
proximation to a sample from a beta process. 


Using a tilted Bernoulli process proposal The rejection 
sampling procedure using a Bernoulli process proposal 
will give low acceptance rates—and therefore high com¬ 
putational cost- if the target distribution / differs sig¬ 
nificantly from the Poisson(ct) distribution implied by 
the IBP. We can improve the acceptance rate—and 
hence ameliorate the computational costs—by expo¬ 
nentially tilting the Bernoulli process likelihood, as de¬ 
scribed in Section I7TTT1 

If we tilt a Bernoulli process (or, equivalently, scale 
the beta process-distributed directing measure accord¬ 
ing to Equation [9] and use the scaled directing measure 
as the base measure for a Bernoulli process), we change 
the distribution over the number of non-negative entries 
Brostrom and Nilsson, 2000 . If we restrict the tilted 
Bernoulli process, however, the distribution is not af¬ 
fected by the tilting parameter /?, i.e. 


R-BeP((7ri,... , 7 r/)) 

e^fri, e^TTi, 

e^7Tl + 1 — 7Tl ’ e^TTj + 1 — 7Tj 

We can maximize the likelihood of getting exactly J 
non-negative entries, by setting /3 to be the unique so¬ 
lution to 



= R-BeP 


J = 


e^7T; 


I 

V_ 

—f e^7 Ti + 1 - 7fj 
1=1 


Thus, we can first sample the number of features J n 
on the nth row from /, Esscher transform the weights 
77 to maximize the chance of getting exactly J n non¬ 
zero entries, and then sample Z n using the transformed 
weights. For computational efficiency, the transformed 
weights can be cached for each value of J n . 


Given an approximate sample 7r = ( 77 ,..., 717) from our 
directing measure, there are a number of methods to 
simulate Z n ~ BeP(7r). We discuss several approaches 
below. 

5.2.1 Rejection Sampling 

Using a Bernoulli Process Proposal Conditioned on 
7 t, it is straightforward to sample binary vectors Z* 
according to a Bernoulli process, by sampling z* ~ 
Bernoulli ^),i = 1We can use these binary 
vectors as proposals in a rejection sampler. If / is the 
desired distribution over the number of non-zero en¬ 
tries per row, we accept a proposal Z* with probability 
Because we have explicitly instantiated (an 
approximation to) the directing measure p, the rows of 
Z are i.i.cl. and we do not need to maintain the sufficient 
statistics of the rejected binary vectors. 


Discussion of approximation quality As / —> 00 , both 
the weak-limit approximation of Equation |T3] and the 
stick-breaking construction of Equation [14] will give ex¬ 
act samples from the R-IBP. However, a finite I will 
introduce errors. When a stick-breaking representation 
for p is used, then we know that all weights 7ry, j > I 
will be less than 7 iy. In particular, the iterative nature 
of the stick-breaking construction means that, if we ex¬ 
clude the first I atoms 7Ti,..., 717 , and scale the remain¬ 
ing atoms by 7iy, we are left with a (strictly ordered) 
sample from the beta process. 

We can consider the error introduced by this con¬ 
struction by considering the values of z n j that are ex¬ 
cluded due to the truncation. If there are any non¬ 
zero elements z n j for j > /, our rejection proba¬ 
bility will not be correct. Since the weights 77 - . j > 
I are described by a scaled beta process, we know 
that the number of excluded non-zero elements will 
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be distributed as Poisson (0:717). So, with probability 
1 — Poisson( 0 ; 0717) = 1 — exp(—717a) the true sum 
£r z n i £[ z n i and thus we may incorrectly reject 
or accept a proposal.. 

We will reject incorrectly if there are any non-zero 
elements z n j for j > I. If we let fi = 7r iSg i , where 

the 7rj are strictly size-ordered, and sample Z* from the 
Bernoulli process BeP(/r), it follows that the distribu¬ 
tion over the number of non-zero elements for z n j ,j>I 
is given by EZi z ni ~ Poisson(o7T/). The probability 
that all elements are zero is given by P{EZi z ni = 0) = 
exp(— nja). Thus, with probability 1 — exp(—717a), the 
true sum £°° z n i ^ £[ z n i and thus we may reject 
incorrectly. 

Specifically, in the case where f = Sj, three pos¬ 
sible outcomes exist when we propose a binary vec¬ 
tor Z* from a size-ordered truncated approximation 
BeP((7Ti,... ,7r/)): 

1. E\=i z * > J'- We reject the proposal. This is always 
correct. 

2. Ei=i z i = J- We accept the proposal. However, if 
the truncated tail has £°£ +1 z* > 0, we should re¬ 
ally have rejected. Our decision is correct with prob¬ 
ability P{EZi+i z* = 0) = exp(—Tiya). 

3. Ei=i z i < J■ We reject the proposal. However, 
if Ei=i z i = J ~ k but the truncated tail has 
YhLi+i z i = k, we will really should have ac¬ 
cepted. Our decision is correct with probability 
1 - p CEZi +1 Z *= J - EL i z *) = 1 ~ Poisson (J - 

ELi z *'^i a )- 

We will use this enumeration to construct an exact sam¬ 
pler in section [ 53 ] 


5.2.2 Sampling using Inclusion Probabilities 


Even with tilting, rejection sampling can be com¬ 
putationally expensive if / differs significantly from 
Poisson(o). Given a finite-dimensional approximation 
to ff to the directing measure, an alternative is to use 
a draw-by-draw procedure based on computing the in¬ 


clusion probabilities P( 2 n i|ff, k n ) of each feature Aires, 
4 The marginal inclusion probability r]k; j that fea¬ 


1999 


ture k is included in a sample of size J is given by 


Vk;J = 71V 


S j_ / (TTl, ■ • ■, fife —1, ffk_|_l, ..., ft/) 
Sj (ff i, • • •, ft/) 


(15) 


where Sj corresponds to the probability of sampling J 
elements from the set of I features if each feature was 


4 More generally, Hanif and Brewer, 1983 lists over 50 
ways to sample without replacement with unequal weights in 
the finite case. 


chosen independently with probability ft,: 

S J = EseAj(i) TLes ^ Yljssi 1 ~ ( 16 ) 

where Aj(I) is the set of all samples of size J that can 
be drawn from the elements I. Fortunately, there is a 
recursion for calculating the values Sj can be computed 
in 0 (/ 2 )-time: 

Sj = TT I SjZ\(tT 1 , ...,7 fj_i) + (1 - ff/) Sj ~ 1 (ff 1 , ..., ff /—1 ) 

(17) 

and thus with appropriate caching, all of the elements 
rjk. j can be computed (and cached) in 0 (/ 3 ) time, and 
any Esscher transform (Equation [9]) of the ff & can be 
used in the recursions above. 

Given the marginal inclusion probabilities 77 //., we 
now have a draw-by-draw algorithm for sampling a row 
Z n from the prior: 

1. Sample the total number of features J n ~ /. 

2. Set J = J n . 

3. For each feature k £ {1,..., I}: 

(a) Sample z n k ~ Bernoulli^,/). 

(b) If z n k = 1, then decrement J •<— J — 1. 


Discussion of approximation quality If a size-ordered 
stick-breaking representation is used to approximate 
the weights 7 r, then we can directly bound the errors on 
the inclusion probabilities as functions of the trunca¬ 
tion level /, the size of the smallest instantiated weight 
7 t/, and the function /. To do so, we first expand the 
expression for the probabilities Sff , starting with equa¬ 
tion [iGl 

st = E IMa 

s€Aj(I)k€s j3s 

+ n^na 3 ) 

s3Aj(I)k£s j3s 

= exp(-7T/o) n ( 1_7r f) 

sGAj(/) /cES j^S,j<I 

+ Yi n^iK 1 -^) 

sBAj(I) k£s j3s 

= exp(-7r / a)S'5 + ^ 

s 3 Aj(I)k£s j3s 

where Aj(I) are still the sets in which all J non-zero 
entries occur in the first / columns. The second line 
follows because the probability of that all the columns 
j > I are zero is exp(— 717 a). 

Since the probability of at least on non-zero element 
in columns j > I is 1 — exp (—717 a), the second term is 
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bounded between 0 and 1 — exp(— 7rja). Thus we can 
bound the inclusion probabilities 


i) k-j — t ifc- 


Sj—l(7Tl, TTk—l , TTfc+lj 7Tj) 


> 7Tfe 


rj) 

... )7rfc lj7r;c+lj ...,7Tj) 


< TTfe- 


e -^i a S j{tti 1 ..., 7T/) + (1 - e _7r/a ) 

^TTjggJ-l 7rfc _ 1) 7 i- fc+1] 7 Tj) + (1 - e-* ia ) 

e - n i a S I j(ir 1 , ...,7Tj) 

As expected, the quality of the approximation depends 
not only truncation I (and associated nj) but also on 
the values Sj. If the probability of sampling J elements 
from the first I is low, then the approximation will be 
poor because it is likely that additional columns would 
have been required to sample J elements. 


5.3 Exact Sampling in a Conditionally Independent 
Model 


We consider rejection sampling in an R-IBP with re¬ 
stricting function / = Sj, where we accept or reject pro¬ 
posals Z* ~ BeP((7Ti,... , 7 r/)). In Section 5.2.1 work¬ 
ing with a truncated version of /i obtained using a stick¬ 
breaking representation means that some proposals are 
erroneously rejected or accepted, due to the absence 
or presence of non-zero elements below the truncation. 
In particular, there were two cases in which we could 
make mistakes: if Z* l{ < J, we might reject incor¬ 
rectly; if %ni = we might accept incorrectly. To 
circumvent the uncertainty in these outcomes, we use 
a dynamic truncation to obtain exact samples from the 


R-IBP by using retrospective sampling Papaspiliopou- 


los and Roberts, 2008 


— Sample an initial truncated directing measure 
7T!... 7 tj according to the size-ordered stick breaking 
representation of the Beta process (Equation [l4]) . 

— For n = 1,..., N, repeat the following proposal step 
until we have accepted a row Z n : 

— Sample z* ... z} ~ 7Ti... 7T/, and compute the 
sum K* = E[=i z*. 

• If K* > J, reject Z*. 

• If K* = J, accept with probability 
exp(—7rja). 

• If K* < J, 

■ Reject with probability 1 — Poisson( J — 

j Znii 7T/O:). 

• Otherwise, expand the representation 
by sampling new 7q, z* for i = I + 1, 1 + 
2,... according to the stick breaking 
representation, until ^ z* = J. Accept 
the resulting Z *, and update I and 7r to 
incorporate the new atoms. 


We can adapt this procedure to arbitrary restricting 
function /, by first sampling a row count J n ~ / for 
each row. The growth of the truncation level / will de¬ 
pend on /; if J n ~ / is large then we may have to 
expand to very large truncation levels I. Specifically, 
starting with a truncation level too small may result 
in many, many rejections before the truncation level is 
sufficiently expanded. However, the samples that we do 
accept will be from the correct R-IBP prior. 


5.4 Empirical Comparison of Simulation Methods 


We empirically compared the simulation approaches 
from Sections [5~H|5. 2.1 5.2. 2[ and A3 by measuring the 
number of rejections and CPU time required to generate 
samples from the R-IBP with concentration parameter 
a = 5 and restricting function f = Sj for J = {2, 5, 8}. 
We generated 25 samples of 100 observations from each 
of the five approaches: exact collapsed rejection sam¬ 
pling (Section 5.1), approximate uncollapsed rejection 
sampling and tilted approximate uncollapsed rejection 
sampling (Section 5.2.1), approximate inclusion sam¬ 
pling (Section 5.2.2), and exact uncollapsed rejection 
sampling (Section 5.3). 

Rejections per 100 observations are shown in fig¬ 
ure [TJ As expected, rejection rates are lowest for J = 5 
because a = 5. Inclusion sampling, a draw-by-draw 
procedure, has no rejections, and tilting significantly 
reduces the number of rejections—and the variance in 
the number of rejections—regardless of J. The other 
procedures all have large rejection rates varying over 
several orders of magnitude. Figure [2] shows CPU time 
on a standard laptop. Again, the time to 100 accep¬ 
tances is shortest when J is equal to the expected value 
of features a. The approximate methods are faster than 
the exact methods, and the approximate tilted rejection 
sampler is the fastest, closely followed by the approxi¬ 
mate sampler that uses inclusion probabilities]^] 

Figures [3] and [4] show the mean of the empirical fea¬ 
ture probabilities, sorted in descending order, for vari¬ 
ous truncation levels for J = 5 and J = 8. When J = 5, 
the exact samplers instantiate between 30-40 hidden 
features. The mean probabilities of the approximate 
methods follow the exact probabilities relatively closely 
even with truncations of I = 10 or / = 20, with only 
slight over-estimation to account for the fewer features. 


5 The wall-clock time difference between the draw-by-draw 
procedure using inclusion probabilities and the approximate 
rejection samplers may be due in part due to Matlab vector- 
ization; a draw-by-draw procedure requires a loop to sequen¬ 
tially compute whether a feature is present while the rejection 
sampler can sample all elements of Z n together. 
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Fig. 1: Rejections per 100 Acceptances 


When J = 8, the exact methods tend to instantiate 35- 
45 features. The approximate methods have a notice¬ 
able over-estimation of feature probabilities when the 
truncation / is too small (e.g. I = 10). However, as the 
truncation is increased, the mean probabilities from the 
approximate methods again closely match those from 
the exact methods. Interestingly, there do not seem 
to be large differences between the different approxi¬ 
mate methods. These explorations suggest that the ap¬ 
proximate methods can be accurate, computationally- 
efficient alternatives when the truncation is set to a 
reasonable value. 


6 Posterior Inference in the R-IBP 

In this section, we present approaches for posterior in¬ 
ference in the R-IBP. In Section [6T] we present MCMC- 
based approaches related to the simulation techniques 
described in Section [5j and in Section [R2| we present a 
computationally faster hybrid variational/MCMC ap¬ 
proach for posterior inference. 


6.1 MCMC-based Posterior Inference in the R-IBP 

6.1.1 Collapsed Inference using an Augmented 
Representation 

In Section [3~2j we showed that the R-IBP can be con¬ 
structed by selecting subsets of an IBP, and in Sec- 
tion l5.1l we showed that this construction can be used to 
generate samples from the R-IBP prior. We can also use 
this construction to construct a collapsed Gibbs sam¬ 
pler, by reintroducing the discarded rows as auxiliary 
variables. For each data point let Z n be the asso¬ 
ciated latent R-IBP-distributed binary representation, 
let t n be an auxiliary variable indicating the number of 
discarded rows between observations n — 1 and n, and 
let c n be an aligned auxiliary vector of counts associated 
with these discarded rows. Let m* = + c ni ) 

be the total observed and auxiliary counts for the itli 
feature. 

Sampling t n and c n When selecting a subset of the 
IBP, t n is the number of discarded samples between the 
n — 1st and the nth accepted samples, and c n is the as¬ 
sociated column counts. We can sample these directly, 
by sampling vectors Z* from the prior predictive distri¬ 
bution of the IBP given the remaining counts, m + c_ n . 
With probability 1 — f(Z *), we include Z* in the auxil¬ 
iary counts c n and f ra , and sample another vector; with 
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Computation Time for J = 2 



Computation Time for J = 5 



Computation Time for J=8 



Fig. 2: Time required for 100 Acceptances on a standard laptop 


Feature Probabilities J=5 Feature Probabilities j=5 

Truncation = 10 Truncation = 20 




Feature Probabilities J = 5 
Truncation=100 



Fig. 3: Mean of empirical feature probabilities, sorted in descending order for varying truncation levels and J = 5 


probability f(Z*) we do not include Z* in c n and stop 
our sampling procedure. 


Alternatively, we can propose smaller changes to the 
current vector Z n . For example, we could propose z' ni = 
1 — z n i, and accept with probability min(l,r) where 


Sampling Z n We have two options for sampling Z n . 
We can propose an entirely new vector Z ', by using 
the ultimate Z* obtained when sampling c n (i.e. the 
proposal Z* that we rejected from c n ) as a Metropolis 
Hastings proposal. Since the proposal is sampled from 
the prior predictive distribution of the R-IBP, we accept 
the proposal with probability 


min 


/ p( Xn \z' n ,e) \ 

\ ’ P(x n \Z n ,G)J 


r _ P{ x n\ z ni) @)P{ z ni I rn ~ z n j t N, ^ n ) Z ni z ni) 

P(Xn\Zni) 0^P(z n i\l7l — Zni , N , ^n) Q^Zni ^ Z ni) 

__ p {Xn\z' ni , z _j, e)ml™ Zni ( 1 - rn]-j£J(z' ni + Ylk^i z nk) 
P{x n \z n ii Z_j, @)TO i; T Zni (1 — m i,—Zi^ Zni z nk) 

Similarly, we could propose changing multiple entries at 
once (necessary if we have, for example, f(x) = 6j(x)), 
or adding/removing new features. 
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Feature Probabilities J = 8 
Truncation = 10 



Feature Index 


Feature Probabilities J=8 
Truncation = 20 



Feature Probabilities J = 8 
Truncation=100 



Fig. 4: Mean of empirical feature probabilities, sorted in descending order for varying truncation levels and J = 8 


6.1.2 Uncollapsed Inference with an Instantiated 
Latent Measure 


If the distribution / over the number of latent features 
per row differs significantly from that implied by the 
IBP, the number of auxiliary features required in a col¬ 
lapsed scheme quickly becomes prohibitive. In practice, 
we found the computational cost of the sampler de¬ 
scribed in Section [6]0] was infeasible in most cases. 

An alternative approach is to alternate sampling the 
latent measure conditioned on the binary matrix, and 
vice versa, mirroring the methods for sampling from the 
prior described in Sections |5.2.1| and |5.2.2| Since we 
cannot represent the entire measure p, we work with 
a finite-dimensional approximation j r = (ffi,..., 717), 
obtained either via a weak limit approximation or via 
truncation in a stick-breaking process. Sections |5.2.2 


and [Oj suggest methods for bounding the resulting er¬ 
ror, or using a dynamic truncation to avoid such error. 


Sampling Z17r If the distribution / is not degenerate 
on a single point, we can use the inclusion probabilities 
described in Section |5.2.2[ combined with / and the 
data likelihood P{X\Z 1 (9), to Gibbs sample the value 
of a single entry, using the conditional probabilities 


If the distribution / is degenerate on a single value 
J, we cannot construct a Gibbs sampler that sequen¬ 
tially turns elements on or off; doing so would change 
the number of features. Instead, we can use the appro¬ 
priate inclusion probabilities to sample the location of 
each of the non-zero elements in a row, conditioned on 
the other J — 1 locations. Let £ n k be the location of the 
jth non-zero entry. Then 


P{inj =i\{TTl,...,ni},£-nj,X 1 0) 

~tl — k n -i-1. 


(X TTi 


S n 


({fife : z nk = 0, k * *}) 


S( kn ' ’({fr k : z nk = 0 or k = i}) 


(19) 


f(k ni ,i + l)P(X\£ nj = i,^ nj ,0) 


The Gibbs sampling steps described in Equations [18| 
and[l9]only change a single element of Z at a time. This 
can lead to slow mixing. We can augment these Gibbs 
sampling steps with Metropolis Hastings proposals gen¬ 
erated from the prior, using either the rejection sam¬ 


pling approach of Section 5.2.1 or the inclusion prob¬ 


ability approach of Section 5.2.2 to propose an entire 
row of the binary matrix. 


P ( Zni — l|{7Tl, • • • , 7T J }, Z __ n i , 0 ) 

~ S I 0 ~ k,l '~'~ 1 ({Tr k : z nk = 0,k yl i}) 

0^- r_ u 

S 1 ({ 7 tfc : z nk = 0 or k = *}) 

• + l)P(X\z ni = 1, Z_ ni , 0) 

P(Zni = 0|{7Tl, . . . ,7Tj}, Z_ ni , X, 0) 

OC f(k n -i)P(X\z ni = 0, Z_ m , 0), 

where k n -k = Z M- 


Sampling the latent measure Once we have sampled our 
binary matrix Z, we must resample our latent feature 
weights 7T. Unfortunately, since the beta process is not 
conjugate to the restricted Bernoulli process, we can¬ 
not directly Gibbs sample n given Z. Instead, we use 
Metropolis-Hastings steps. Since the posterior distribu¬ 
tion over 7T given Z is likely to be similar to the poster 
distribution in the unrestricted IBP, we use the poste¬ 
rior distribution from the unrestricted IBP as a pro¬ 
posal distribution. The acceptance probability depends 
on the R-IBP likelihood (Equations [4] and [5]) . 
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6.2 Hybrid Variational Inference in the R-IBP 


The standard variational approach for the IBP Doshi 


et ah, 2009 uses a mean-field approximation which 


places independent distributions q(z n i ) over each fea¬ 
ture assignment z nl . Using such a factored distribution 
is straightforward because each assignment z n t is drawn 
independently given the weight n $. However, variational 
inference in the R-IBP is challenging because fixing the 
number of active features J n introduces dependence be¬ 
tween the z n i, and because the implied prior distri¬ 
butions over the marginal inclusion probabilities rjik 
are complex. Further, the invariance of the likelihood 
to scaling the directing measure, as described in Sec¬ 
tion [3T4j can lead to inefficiencies in exploring the state 
space and computational difficulties due to very small 
atom sizes that may occur at certain scales. 

We propose a hybrid variational for inference in 
the R-IBP that combines variational distributions over 
the feature assignments and model parameters with 
MCMC inference over the directing measure. As in Sec¬ 
tion | 6 . 1 . 2 | we work with a finite dimensional approxi¬ 
mation 7 r to the directing measure. We assume that the 
weights 77 are fixed during the variational update, and 
then alternate between resampling the 77 and updat¬ 
ing the variational posterior on the other variables. We 
demonstrate this approach using a linear-Gaussian like¬ 
lihood, where the data X are assumed to be generated 
by ZA + e, where A is an I x D feature matrix with 
independent normal priors Normal(0, a\) on each value 
and e is a N x D matrix of independent noise drawn 
from Normal(0, cr^). We note that the inference of the 
feature matrix A is the same as in the standard IBP, 
and other likelihood models developed for the IBP can 
be substituted. 

Specifically, the variables in the variational update 
are the feature assignments Z , the feature values A , 
and the count of active features per observation J n . We 
consider the following mean field approximation for the 
variational inference: 


— q^Ai) independent Gaussian distributions with 
mean cj>i, variance <Pi on the posterior of the feature 
value vector A t . 

— q Uni (z n i) independent Bernoulli distributions, where 
v n i is the probability that z n i is active. 

— q-y nk (Jn) multinomial distributions over the number 
of features in observation n , where ~/ n k is the prob¬ 
ability that observation n has k active features. 

Let W = {<fi, v, 7 } be the set of variational 
parameters, and let V = {A, Z, J n } be the set of 
variables. Because the actual and variational dis¬ 
tributions belong to the exponential family, coor¬ 
dinate ascent on the variational parameters cor¬ 


responds to setting the variational distribution 
log(gwJ = Ew_i [log(P(JF, V\X, 0))], where 0 denotes 


the set of hyper-parameters {a 2 , cr 2 , a, /} Wainwright 


and Jordan, 2008 


We focus on providing the variational updates for 
the parameters associated with Z and J n , as the up¬ 
dates for the parameters associated with A (i.e., (f> k , <&k) 
are exactly the same as in Doshi et al., 2009] . The up¬ 
date for 7 n k is: 


lo g( 97 „(^n)) =E z [logP(J n )+logP(Z n \n,J n )] 

= J2k=iHJn = fc) [log (fnk) + V n i log(r7»fc) 
+ (1 - I'm) log(l - 7?»fe)] 

where f nk is the prior probability that observation n has 
k elements. Exponentiating and normalizing, we recover 
the posterior parameters 7 n k- 

The update for variational parameters v nl for the 
assignments Z are also straightforward given the inclu¬ 
sion probabilities rjik' 


log {Qvni(Zni)) =Ej nt z. ni ,A^Og(P(z n i\Z_ n i,Tt, J n )) 
+ log(P(X n \Z n ,A,a 2 n ))\ 


( 20 ) 


where the second term is again exactly the same as 
in Doshi et ah, 2009 . For the first term, we can write 


Ej n ,z_ nt [log(P(z ni |Z_ ni , 7 r, J„))] 

=Ej nt Z- ni [ZniI(Jn = k) log(r?ifc) 

+ (1 - Zni)I{J n = k) log(l - T] lk )} 

=ZniY. k lnk l0g(j^) +C. 

Substituting equation [21] into equation [20j we derive 
the update 


? = Efc Ink log(T^) - ( - 2^ + Tr($i) 

+ + 2<(>i(^I 'nj<t>j) 

Vni l+ eX p(—’ 

( 22 ) 

The equations above show how to update the vari¬ 
ational distributions on Z, A, and J n given tt. During 
our inference process, we iterate through the following 
steps: 

1. Computing the partial variational posterior on Z , 
A, and J n . 

2. Sampling values of Z , A, and J n from the variational 
posterior. 

3. Sampling new values of 7 r given the sampled Z. 
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To resample 7r given Z, we use the Metropolis- 
Hastings step described in Section |6.1.2[ where we 
jointly propose a new set of { 77 ^...7r^} from the weak- 
limit approximation to the IBP posterior distribution 
fir' ~ Beta(“ + to;, 1 + N - m*), where mi = Y^ n z n i- 
Next we accept or reject using the beta process prior 
on 7 r and the likelihood P(Z\tt' , J n ), which can be com¬ 
puted using the inclusion probabilities rf ik . 


7 Evaluation 

We show a variety of evaluations to demonstrate the 
value of using the R-IBP on real and synthetic data 
when we have some knowledge about the marginals on 
the number of non-zero entries. 


7.1 Exploration with Synthetic Data 

To explore the ability of the R-IBP to recover la¬ 
tent structure, we generated two datasets using a lin¬ 
ear Gaussian model, and used the IBP, the R-IBP 
with an appropriate restricting distribution, and the 
partially-exchangeable R-IBP with labeling informa¬ 
tion described in Section 14721 to recover the latent struc¬ 
ture. 


7.1.1 Knowledge about the Number of Latent Features 
Assists with Parameter Recovery. 


One reason for using the R-IBP is when we have strong 
ideas of what a “feature” corresponds to, coupled with 
strong information about the number of such features. 
While an IBP may be able to model the data using a 
collection of features, these features may not correspond 
to our preconceived notions of features - for example, 
the IBP might use multiple features where we expect a 
single feature. 

To explore this, we generated a toy dataset with 
a total of 15 latent features. We generated 400 obser¬ 
vations with 14 of the 15 latent features, and 100 ob¬ 
servations with a single latent feature. We assumed a 
user-defined, observation-specific distribution over the 
number of features (corresponding to the partially ex¬ 
changeable model described in Section 4.2). Specifically, 
if an observation X n contains k n features, we used a re¬ 
stricting distribution /„ that is uniform over k n ± 1. 

Figure [5] shows qualitative results on the toy data. 
The first column shows the true features and the true 
distribution on the number of active features in each ob¬ 
servation. Because many of the features occur in many 
of the data sets, the IBP (center column) does not re¬ 
cover the true features, nor does it recover a distribution 


of active features that is close to the true distribution. 
In contrast, the R-IBP (right column) recovers a latent 
structure that is much closer to true parameters. 

7.1.2 Knowledge about the Feature Distribution Assists 
with Predictive Performance 

While interpretable features are desirable, we do not 
want them to come at the expense of predictive perfor¬ 
mance. To evaluate predictive performance, we consid¬ 
ered 500 observations from a one-inflated Poisson model 
in which 80% of the observations have one associated 
latent feature and the remaining 20% have a Poisson- 
distributed number of associated latent features with 
mean A. Such a model might be relevant when model¬ 
ing patients in a typical clinical practice, where most 
patients might have very simple complaints and a few 
patients may have a very complex combination of dis¬ 
eases. We apply the Gibbs sampler for A = {3, 6, 9,12}; 
the concentration parameter for the IBP was set to the 
mean number of features per observation in each set¬ 
ting. 

We explore two variants of the R-IBP: in the 
fully exchangeable version, we know that observations 
come from a mixture distribution but we do not know 
whether the observation is associated with the spike or 
the slab; all observations have the same f n = 0.8<5i + 
0.2Poisson(A). In the partially exchangeable version, we 
know to which mixture component the observation be¬ 
longs. If the observation belongs to the spike, we have 
fn = Si, otherwise we have /„ = Poisson(A) This as¬ 
sumption may be reasonable in many domains; for ex¬ 
ample, it may be easy to tell if a patient has a simple 
or complex condition without knowing explicitly what 
diseases a patient with complex diseases has. 

We randomly held out 1% of the data. Figure [6] 
shows the negative log-likelihoods on the held-out data 
averaged over 5 runs of 500 iterations each (lower is bet¬ 
ter) . When the mean number latent features in the slab 
distribution A = 3, all observations have few features, 
and the R-IBP variants performs slightly worse than 
the IBP - something we attribute to slower mixing and 
therefore slower convergence, due to the lack of conju- 
gacy. However, as the slab mean A increases, the R-IBPs 
variants consistently out-perform the IBP. As expected, 
the partially-exchangeable variant, in which each ob¬ 
servation contains a covariate describing whether it is 
a member of the spike or the slab, does the best. 


7.2 Comparison on Multiple Real Data Sets 

We compare the two inference approaches for the R-IBP 
from sections 16.11 and 16.21 to three IBP baselines. The 
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Histogram of Features per Observation, R-IBP 



Number of Active Features 


(d) True Histogram over active features (e) IBP Histogram over active features (f) R-IBP Histogram over active features 
per observation per observation per observation 


Fig. 5: Examples of features and counts of active features found by the variational inference for the R-IBP and 
the IBP on the toy data. The R-IBP recovers patterns much closer to the true features than the IBP, in which 
observations with just one feature tend to get assigned no features, while observations with many get a few generic 
features corresponding to most dimensions being active. In contrast, the R-IBP recovers a histogram of features 
per observation that is much closer to the true distribution. 


hybrid variational IBP applies the same hybrid varia¬ 
tional approach to inference in the IBP as was devel¬ 
oped for the R-IBP in section [672} We also compare to 


Gibbs sampling in the IBP Griffiths and Ghahramani. 


2011 and the standard variational inference approach 
for the IBP Doshi et al., 2009 . In all cases, we the 


linear Gaussian likelihood model in which the data X 
are assumed to be generated by ZA + e, where A is an 
I x D feature matrix with independent normal priors 
Normal(0, <r^) on each value and e is a N x D matrix 
of independent noise drawn from Normal(0, <r^). Both 
the Gibbs sampler and the variational methods were 
run for 300 iterations. For the hybrid variational meth¬ 
ods, the weights were resampled every 25 iterations of 
the coordinate ascent. All methods were run 5 times. A 
random 1% of the data was held-out for evaluation. 

We compare these methods on several data sets: 


— The chord data set consists of a collection of three- 
note chords and single notes. All 1320 three-note 
permutations and all 12 single notes for the octave 
containing middle C were synthesized into wav files 
using MIDIUtil and FluidSynth; the power spec¬ 
trum of these wav files was evaluated at every 10Hz 


between 0 and 1000Hz, resulting in a dataset with 
100 dimensions. For the R-IBP, we used the partially 
exchangeable version, where we provided informa¬ 
tion stating that single notes had 1 latent feature 
in expectation (fc„ = 1) and chords had 3 latent 
features in expectation (k n = 3). 

— The newsgroup data-set is the sub¬ 
set of the 20 newsgroups data set from 

http://www.cs.nyu.edu/ roweis/data.html, consist¬ 
ing of the counts for the top 100 words for 5000 
documents. We arbitrarily set k n = • where L n 

was the length of the document. 

— The NPR data set consisted of the 365 features and 
the 365 summaries from April 2013 to April 2014^] 
The stories were processed through NLTK clean and 
we kept the 1964 most common words. We provided 
the information that the expected number of topics 
in a features story was k n = 1 while the expected 
number of topics in a summary was k n = 5. 


6 Source: http://www.npr.org/api/queryGenerator.php 
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Toy: Slab Mean 3 Toy: Slab Mean 6 Toy: Slab Mean 9 Toy: Slab Mean 12 



Fig. 6: Negative log-likelihoods (lower is better) for data from a one-inflated Poisson model with the mean of 
the Poisson A = {3,6,9,12}. R-IBP is the fully exchangeable R-IBP model, whereas R-IBP-PE is the partially- 
exchangeable R-IBP model where each observation is associated with a covariate describing from which distribution 
it comes. 


In all cases, the distribution / was set to be uniform 
over k n ± 1. We used a linear Gaussian likelihood in all 
cases. 

Likelihoods and training times for the toy problem 
and other problems are shown in tables [TJ [2] and [3] Here 
we see that the auxiliary information provided by the R- 
IBP also translates into better likelihoods and not just 
qualitatively better parameter recovery. As expected, 
the variational inference also runs significantly faster 
than the MCMC-based approaches; however in some 
of the experiments the variational approach yielded a 
lower quality estimate (shown most clearly in the NPR 
dataset). 


8 Discussion and Future Work 


The Restricted Indian Buffet Process is a useful tool for 
latent feature modeling with a non-Poissonian number 
of latent features per data point. In this article, we have 
expanded on the original exposition [Williamson et a~ 
2013 by providing new representations that connect 


the R-IBP to tilted CRMs and the scaled beta-prime 
process. We also provide several alternatives for exact 
and approximate simulation from the R-IBP, as well as 
new inference algorithms, including a computationally 
efficient variational/MCMC hybrid algorithm. 

While the IBP often has reasonable performance on 
data sets with arbitrary distributions over the num¬ 
ber of features—rather than a Poisson distribution 
we find that additional knowledge about the number of 
features can be very helpful if it is available. In par¬ 
ticular, a common challenge when performing inference 
with the IBP is that it often learns combinations of 
features as a single feature, especially when there are 
correlations between features. While these feature com¬ 


binations may reasonably represent the data, a latent 
variable model that learns such grouped features will 
do poorly if asked to make predictions on observations 
where that correlation is not present. With the R-IBP, 
it is possible to specify the expected number of features 
in an observation, allowing us to discover features with 
both better interpretability and generalization. 

In general, we see the most pronounced differences 
in situations where we had strong prior knowledge 
about the number of features in a dataset—such as 
the chord and toy examples. Differences were less pro¬ 
nounced in data sets such as newsgroups, where we 
made somewhat arbitrary decisions about the poten¬ 
tial number of features based on document lengths; in 
general the IBP is a sufficiently flexible prior to cap¬ 
ture posteriors with relatively small deviations from 
Poisson-distributions on the number of latent features, 
and in this case we actively decreased this flexibility. An 
interesting direction for further research would be try 
to leverage less strong prior information—such as the 
information in the NPR data set where some stories 
are features and some stories are collections of multiple 
news summaries. 

More broadly, while we have focused on the Indian 
Buffet Process, the concepts described in this paper 
are applicable to other nonparametric models such as 
the beta-negative Binomial process or gamma-Poisson 
process. As we discussed in Section [4j the variety of 
possible restrictions is much broader when considering 
non-binary matrices, which are often used for modeling 
count data. It will be interesting to explore where re¬ 
stricted models can be effectively used in this context; 
in principle different restrictions can allow domain ex¬ 
perts to encode a rich number of kinds of prior knowl¬ 
edge. 
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Table 1: Comparison of training set likelihoods for the R-IBP and the IBP. 



Chord 

Newsgroups 

NPR 

Hybrid-Var. 
R-IBP 

-1.25e+05 

(-1.25e+05, -1.25e+05) 

-2.33e+06 

(-2.33e+06, -2.33e+06) 

-1.65e+07 

(-1.66e+07, -1.65e+07) 

Hybrid-Var. 
IBP 

-2.13e+05 

(-2.13e+05, -2.13e+05) 

-2.38e+06 

(-2.38e+06, -2.38e+06) 

-6.49e+06 

(-6.50e+06, -6.48e+06) 

Variational 

IBP 

-2.13e+05 

(-2.13e+05, -2.13e+05) 

-2.39e+06 

(-2.39e+06, -2.39e+06) 

-6.90e+06 

(-6.96e+06, -6.83e+06) 

Gibbs R-IBP 

-1.33e+05 

(-1.33e+05, -1.33e+05) 

-2.34e+06 

(-2.34e+06, -2.34e+06) 

-4.89e+06 

(-4.90e+06, -4.89e+06) 

Gibbs IBP 

-1.25e+05 

(-1.25e+05, -1.25e+05) 

-2.34e+06 

(-2.34e+06, -2.34e+06) 

-5.15e+06 

(-5.16e+06, -5.14e+06) 


Table 2: Comparison of test set likelihoods for the R-IBP and the IBP. 



Chord 

Newsgroups 

NPR 

Hybrid-Var. 
R-IBP 

-2.11e+03 

(-2.13e+03, -2.09e+03) 

-2.38e+04 

(-2.38e+04, -2.37e+04) 

-1.77e+05 

(-1.80e+05, -1.74e+05) 

Hybrid-Var. 
IBP 

-2.12e+03 

(-2.13e+03, -2.11e+03) 

-2.43e+04 

(-2.43e+04, -2.42e+04) 

-7.07e+04 

(-7.14e+04, -7.00e+04) 

Variational 

IBP 

-2.12e+03 

(-2.13e+03, -2.11e+03) 

-2.43e+04 

(-2.43e+04, -2.42e+04) 

-7.28e+04 

(-7.33e+04, -7.22e+04) 

Gibbs R-IBP 

-2.26e+03 

(-2.29e+03, -2.24e+03) 

-2.37e+04 

(-2.37e+04, -2.37e+04) 

-5.46e+04 

(-5.48e+04, -5.44e+04) 

Gibbs IBP 

-2.32e+03 

(-2.34e+03, -2.29e+03) 

-2.37e+04 

(-2.38e+04, -2.37e+04) 

-5.79e+04 

(-5.81e+04, -5.77e+04) 


Table 3: Comparison of running times (in seconds) for the R-IBP and the IBP. 



Chord 

Newsgroups 

NPR 

Hybrid-Var. 
R-IBP 

1.43e+03 

(1.42e+03, 1.44e+03) 

1.56e+05 

(1.55e+05, 1.58e+05) 

1.02e+04 

(1.01e+04, 1.02e+04) 

Hybrid-Var. 
IBP 

9.68e+02 

(9.60e+02, 9.76e+02) 

3.21e+04 

(3.19e+04, 3.23e+04) 

1.65e+04 

(1.63e+04, 1.66e+04) 

Variational 

IBP 

1.05e+03 

(1.04e+03, 1.06e+03) 

3.69e+04 

(3.67e+04, 3.72e+04) 

1.50e+04 

(1.49e+04, 1.50e+04) 

Gibbs R-IBP 

3.56e+03 

(3.52e+03, 3.60e+03) 

2.04e+04 

(1.99e+04, 2.08e+04) 

9.97e+03 

(9.70e+03, 1.02e+04) 

Gibbs IBP 

2.01e+03 

(1.99e+03, 2.02e+03) 

1.33e+04 

(1.31e+04, 1.35e+04) 

7.39e+03 

(7.20e+03, 7.58e+03) 


Finally, there is much to be explored on approaches 
for incorporating the kinds of observation-specific re¬ 
strictions described in this work. The R-IBP has a nat¬ 
ural interpretation as an IBP with arbitrary distribu¬ 
tions on the number of features in each observation. 
However, as we discussed in Section [3~Tj there is an ex¬ 
tra degree of freedom when we specify the Restricted 
IBP with a beta process or a beta-prime process. Intu¬ 
itively, this invariance arises because conditioned on the 
number of latent features in an observation, the scale of 
the weights no longer matters. Any restriction that can 
be viewed as conditioning will result in this property. In 
theory, working with a normalized beta-prime process 
would remove this invariance; in practice, working with 
a normalized beta-prime process is intractable. 


However, there do exist other tractable normal¬ 
ized random measures James et al., 2009 such as the 


Dirichlet process and other and nonparametric proba¬ 


bility measures such as the Pitman-Yor process Pitman 


and Yor, 1997 . These measures could be substituted 


for the beta-prime process in Equation [8] The result¬ 
ing model could no longer be interpreted as a restricted 
version of the IBP, but it is nonetheless a valid model 
that may have very similar properties. Having a more 
potentially more tractable directing measure may assist 
in developing robust and scalable inference techniques 
for restricted models. 
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